Parsing
Core Concept
Parsing is the task of predicting a syntactic structure over a sentence—typically a tree that represents constituency (phrase structure) or dependency (head–dependent relations). The output is a rooted tree whose nodes or arcs are labeled (e.g. NP, VP, or subject, object). Valid outputs must satisfy grammatical constraints (e.g. well-formed brackets, single head per word), so parsing is both a structured prediction problem and a constrained decoding problem. Models are trained on treebanks (sentences with gold trees) and evaluated by how well predicted trees match gold (attachment scores, bracket F1). Parsing supports downstream tasks such as information extraction, question answering, and machine translation.
Key Characteristics
- Tree-shaped output – Output is a tree (constituency: nested phrases; dependency: directed arcs from heads to dependents). The space of valid trees is defined by a grammar or dependency rules, so decoding must produce only valid structures.
- Global structure – Decisions are globally constrained (e.g. every word has exactly one head in dependency parsing; brackets must nest in constituency). Local classifiers or scorers are combined with dynamic programming or transition systems to enforce validity.
- Two main formalisms – Constituency parsing produces phrase-structure trees (e.g. NP, VP, S); dependency parsing produces trees of head–dependent arcs, one head per word (except root). Conversion between them is possible but not trivial.
- Decoding – Constituency: CKY or chart parsing with a context-free grammar; dependency: dynamic programming (e.g. Eisner) for projective trees, or transition-based (shift-reduce, arc-standard) for general trees. Neural parsers often use transition-based decoding with learned policies.
Common Applications
- Constituency (phrase-structure) parsing – Producing nested phrase brackets and nonterminal labels (e.g. Penn Treebank style); used in grammar checking, extraction of phrase-level semantics.
- Dependency parsing – Producing head–dependent arcs and relation types; used in many NLP pipelines (relation extraction, semantic role labeling, MT).
- Semantic parsing – Mapping sentences to meaning representations (e.g. logical form, AMR); output may be a tree or graph.
- Code parsing – Producing ASTs (abstract syntax trees) for source code; supports refactoring, analysis, and code generation.